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Abstract. As we gain experience with parallel file systems, it becomes 
increasingly clear that a single solution does not suit all applications. For 
example, it appears to be impossible to find a single appropriate interface, 
caching policy, file structure, or disk-management strategy. Furthermore, 
the proliferation of file-system interfaces and abstractions make applica- 
tions difficult to port. 

We propose that the traditional functionality of parallel file systems be 
separated into two components: a fixed core that is standard on ail plat- 
forms, encapsulating only primitive abstractions and interfaces, and a set 
of high-level libraries to provide a variety of abstractions and application- 
programmer interfaces (APIs). 

We present our current and next-generation file systems as examples of 
this structure. Their features, such as a three-dimensional file structure, 
strided read and write interfaces, and I/O- node programs, are specifically 
designed with the flexibility and performance necessary to support a wide 
range of applications. 


1 Introduction 

Scientific applications are increasingly dependent on multiprocessor computers 
to satisfy their computational needs. Many scientific applications, however, also 
use tremendous amounts of data [11]: input data collected from satellites or seis- 
mic experiments, checkpointing output, and visualization output. Worse, some 
applications manipulate data sets too large to fit in main memory, requiring 
either explicit or implicit virtual memory support. The I/O system becomes the 
bottleneck in all of these applications, a bottleneck that is worsening as processor 
speeds continue to improve more rapidly than disk speeds. 

Fortunately, it is now possible to configure most parallel systems with suf- 
ficient I/O hardware [22]. Most of today^s parallel computers interconnect tens 
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or hundreds of processor nodes, each of which has a processor and memory, with 
a high-speed network. Nodes with attached disks are usually reserved as I/O 
nodes , while applications run on some cluster of the remaining compute nodes. 

In the past few years, many parallel file systems have been described in 
the literature, including Bridge/PFS [12], CFS [35], nCUBE [9], OSF/PFS [38], 
sfs [27], Vesta/PIOFS [6], HFS [25], PIOUS [30], RAMA [29], PPFS [19], Scotch [15], 
and Galley [31, 32]. Many more techniques for improving the performance of 
parallel file systems have been described, including caching and prefetching [24, 

23, 34], two-phase I/O [10], disk-directed I/O [20], compute-node caching [37], 
chunking [40], compression [41], filtering [21, 2], and so forth. 

The diversity of current systems and techniques indicates that there is clearly 
no consensus about the structure of, interface to, or even functionality of parallel 
file systems. Indeed, it seems that no one interface or structure will be appro- 
priate for all parallel applications; for maximum performance, flexibility of the 
underlying system is critical [25]. It is important that applications be able to 
choose the interface and policies that work best for them, and for application 
programmers to have control over I/O [46, 8]. 

This diversity of current systems, particularly of the application-programmer’s 
interface (API), also makes it difficult to write portable applications. Nearly ev- 
ery file system mentioned above has its own API. A standard interface is being 
developed, MPI-IO [5], but even that interface is appropriate only for a certain 
class of applications. 


2 Solution 

We believe that flexibility is needed for performance. An application programmer 
should be able to choose the interfaces and abstractions that work best for that 
application. To be practical, however, these interfaces and abstractions should 
be available on all platforms, so the application is portable, and each platform 
should support multiple interfaces and abstractions, so the platform is usable by 
many applications. 

Consider Figure 1. Most traditional parallel file-system solutions attempt to 
provide a common file system that hopes to fit all applications. This common 
“core” file system is fixed, in that it must be used by all applications accessing 
parallel files. 1 To increase flexibility, we propose to move much of the function- 
ality out of the core and into application libraries. Our new Galley Parallel File 
System takes this “RISC” -like approach. 

The new core file system provides only a minimal set of services, leaving 
higher-level interfaces, semantics, and functionality to application-selectable li- 
braries. While the implementation of the core is platform dependent, and pro- 
vided by the platform vendor, its interface is standard across all platforms. This 
approach has proven successful with the MPI message- passing standard [28]. 


1 We avoid the term “kernel,” as the core may be comprised of user- level libraries, 
server daemons, and kernel code. 





Fig. 1. Our proposed evolution of parallel file-system structure. Traditional systems 
depend on a fixed “core” file system that attempts to serve all applications. In our 
Galley File System, we shrink the core to leave the API and many of the parallel features 
to an application-selectable library. In our next-generation Galley2 File System, we 
shrink the core further to allow user-selected code to run on the I/O nodes. 


Application programmers may then choose from a variety of different lan- 
guages and libraries, to select one that best fits the application’s needs. Some 
languages or libraries would provide a traditional read- write abstraction; others 
(probably with compiler support) would provide transparent out-of-core data 
structures; still others may provide persistent objects. Some libraries may be 
designed for particular application classes like computational chemistry [13] or 
to support a particular language [7, 4]. Finally, some compilers and program- 
mers may choose to generate application-specific code using the core interface 
directly. 

The concept of I/O libraries is not new; the C stdio library and the C++ 
iostreams library are common examples, both layered above the “core” kernel 
interface. Yet few parallel file systems have been designed specifically to support 
a variety of high-level libraries. The difficulty is in deciding how to divide fea- 
tures between the core and the application libraries, and then in designing an 
appropriate core interface. In our research to explore this issue, we are building 
two generations of file systems. In the first, Galley, we investigate the underlying 
file abstraction, a low-level read/ write interface, and resource-scheduling alter- 
natives. In the second, with the tentative name Galley2, we go a step further 
and allow user code to run on the I/O nodes. The next two sections discuss each 
file system in more detail. 


3 The Galley Parallel File System 

Our current parallel file system, Galley [31, 32], looks like Figure lb. A more 
detailed picture is shown in Figure 2. The core file system includes servers that 
run on the I/O nodes and a tiny interface library that runs on the compute nodes. 
The I/O-node servers manage file-system metadata, I/O-node caching, and disk 






scheduling. The interface library translates library calls into messages to servers 
on the I/O nodes and arranges the movement of data between compute and I/O 
nodes. The higher-level application library, if any, is responsible for providing a 
convenient API, data declustering, file-access semantics, and any compute-node 
caching. 


Compute Nodes 



Fig. 2. The structure of the Galley parallel file system includes a tiny interface library 
on the compute node, which coordinates communication between application I/O li- 
braries on the compute nodes and servers on the I/O nodes. 


Galley’s servers provide a unified global file-name space. Each file is actually 
a collection of subfiles , each of which resides entirely on one I/O node. Each 
subfile is itself a collection of one or more named forks. Each fork is a sequence 
of bytes, the traditional file abstraction. Galley’s core file system provides no 
automatic data declustering; a library may choose to stripe data across subfiles, 
for example. 

Galley’s forks are specifically designed to support libraries. In particular, 
some libraries may wish to store metadata in one or more forks of the subfile, 
with data in other forks. The traditional approach is to place the metadata 
in an auxiliary file or in a “header” at the beginning of the data. The former 
approach makes file management awkward, as there is more than one file name 
involved in a single data set. The latter approach makes it difficult to access the 
file through multiple libraries, each of which expects its own header, and can 
complicate declustering calculations. In Galley each library can add its own fork 
to the subfiles, containing its own metadata. 

The structure of parallel files, beyond the fact that they are collections of local 
files, is completely determined by library code. Multiple applications wishing 




to use the same parallel files must maintain a mutually agreed structure, by 
convention. 

In an extensive characterization of parallel scientific applications [33], we 
found that many applications access files in small pieces, typically in a regular 
“strided” pattern. To allow application libraries to support these patterns effi- 
ciently, the Galley interface supports both structured (e.g., strided and nested 
strided) and unstructured read and write requests. This interface leads to dra- 
matically better performance [32]. 

Galley’s features, including the global name space, three-dimensional file 
structure, and structured read and write requests, make it a suitable and ef- 
ficient base for constructing parallel file systems, much more so than building 
directly on distributed Unix systems. 

More information about Galley is available on the WWW 2 and in forthcom- 
ing papers [31, 32]. 

4 The Galley2 Parallel File System 

Our next-generation file system, which we so far call “Galley2” for lack of a better 
name, goes beyond Galley to allow application control over I/O-node activities. 
We keep the same three-dimensional file structure of subfiles and forks, and we 
keep the global name space, but we otherwise reduce the core file system to a 
minimal local file system on each I/O node, and allow application-supplied code 
to run on the I/O nodes (see Figure lc). Indeed, we expect that an I/O node 
would have an active process (or thread) for each application with files on that 
I/O node. Figure 3 gives a more detailed picture of this structure. 

This structure breaks away from the traditional client-server structure to 
allow for “programmable” servers. A fixed, common server always forces design- 
ers to choose between specific high-level services that may not fit the needs of 
all applications, and primitive low-level operations that permit flexibility in the 
clients but at the cost of extensive client-server communications. Galley makes 
a reasonable choice here, but (for example) uses a fixed caching policy. 

In Galley 2 the core file system is extremely simple: there is no caching, 
prefetching, or remote access. It provides a (local) interface to open, close, read 
and write forks through a block- level interface, and it arbitrates among I/O- 
node programs competing for processor time, memory, disk access, and network 
access. In short, it focuses on the shared aspects of the file system. 

Thus, Galley 2 applications can choose nearly all features of the parallel file 
system, including the API, caching, prefetching, declustering, inter-node com- 
munication protocols, synchronization and consistency, and so forth. Again, we 
expect most applications to choose from pre-defined libraries, but we also en- 
courage use of application-specific code written by application programmers, 
generated automatically by compilers, or generated at run time [36]. We refer to 
all of these choices as “application-selected code.” 


2 http : //www . cs . dartmouth . edu/"nils/galley . html 
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Fig. 3. The structure of the Gailey2 parallel file system depends on application I/O 
libraries that have components on both the compute and I/O nodes. The I/O-node 
servers shrink down to simple I/O managers that arbitrate resources among the local 
user-selected library modules. 


There are many reasons to allow application-selected code on the I/O node. 
Application-specific optimizations can be applied to I/O-node caching and prefetch- 
ing. Mechanisms like disk-directed I/O [20] can be implemented, using application- 
specific data-distribution information. File data can be distributed among mem- 
ories according to a data-dependent mapping function, for example, in applica- 
tions with a data-dependent decomposition of unstructured data [21]. Incoming 
data can be filtered in a data-dependent way, passing only the necessary data 
on to the compute node, saving network bandwidth and compute-node mem- 
ory [21, 2]. Blocks can be moved directly between I/O nodes, for example, to 
rearrange blocks between disks during a copy or permutation operation, without 
passing through compute nodes. Format conversion, compression, and decom- 
pression are also possible. In short, there are many ways that we can optimize 
memory and disk activity at the I/O node, and reduce disk and network traffic, 
by moving what is essentially application code to run at the I/O node in addition 
to the compute nodes. 

Although it would be feasible to use a Unix file system as the local file system, 
the semantics and interface are not appropriate for the highest performance. In 
particular, the Unix file-system interface does not give the applications enough 
control, would have no global name space, and has an inefficient copy-based 
interface. 



5 Research directions 


The success of our design clearly depends on the ability of the I/O-node oper- 
ating system to efficiently manage its resources while providing the necessary 
functionality. We are exploring the following issues: 

- resource management: how should the I/O node manage its shared resources 
in the presence of competing applications? The result must be a tradeoff 
between overall system throughput and individual application performance. 
Traditional uniprocessor policies do not directly apply to this distributed 
situation; local resource decisions can have a disproportionate global impact 
on performance. 

- physical memory allocation: how should we best allocate physical memory 
among I/O-node programs? 

- processor scheduling: how shall we schedule the CPU among I/O-node pro- 
grams? What about applications that choose to move some non-I/O-related 
computation to the I/O node? 

- disk transfers: what is an appropriate interface for requesting I/O to and 
from buffers? 

- message-passing: what is the best interface for I/O-node programs to com- 
municate with the compute nodes, and with each other? 

- What is the appropriate mechanism to support I/O-node programs? We are 
considering three alternatives: processes, threads within a safe language like 
Java [16] or Python 3 , and threads running sandboxed code [45]. There are 
three primary issues in this consideration: 

1. how is the I/O-node manager protected from I/O-node programs? With 
normal hardware protection, in the case of processes; with type-safe lan- 
guages like Java; or with sandboxing. 

2. how is the code loaded onto the I/O node? Presumably they can be 
loaded from disk in the same way as the compute-node code. The tricky 
part might be dynamic linking of sandboxed code. 

3. what is the overhead? 

6 Related work 

The Hurricane File System (HFS) [25], a parallel file system for the Hector mul- 
tiprocessor, is also designed with the philosophy that flexibility is critical for 
performance. Indeed, their results clearly demonstrate the tremendous perfor- 
mance impact of choosing the right file structure and management policies for 
the application’s access pattern. HFS is actually a collection of building-block 
objects that can be plugged together differently according to application needs. 
For example, some building blocks distribute data across multiple disks, others 
provide prefetching policies, and others define an API. HFS allows the program- 
mer to replace or extend application-level building blocks, but these do not 


3 http://www.python.org/ 



include the objects that control declustering, replication, parity, or other server- 
side attributes. Galley permits, but does not enforce, a building-block approach 
to library design; other approaches are possible. Finally, the Hurricane operating 
system does not dedicate nodes to I/O, so it is not unusual for application code 
to run on “I/O” nodes. 

The Portable Parallel File System (PPFS) [19] is a testbed for experimenting 
with parallel file-system issues. It includes many alternative policies for declus- 
tering, caching, prefetching, and consistency control, and allows application pro- 
grammers to select appropriate policies for their needs. It also supports user- 
defined declustering patterns through an upcall function. Unlike Galley, however, 
there is no clearly defined lower-level interface to which programmers may write 
new high-level libraries. Unlike Galley2, it does not allow application-selected 
code (beyond that already included in PPFS) to execute on the I/O nodes. 

In the Transparent Informed Prefetching (TIP) system [34] an application 
provides a set of hints about its future accesses to the file system. The file 
system uses these hints to make intelligent caching and prefetching decisions. 
While this technique can lead to better performance through better prefetching, 
it only affects prefetching and caching behavior. It is possible to provide “hints 
that disclose,” in their words, for other aspects of the system, but it is unclear 
that these hints can provide the same amount of flexibility offered by Galley and 
Galley 2. 

All three of these systems provide the application programmer some control 
over the parallel file system, primarily by selecting existing policies from the 
built-in alternatives. 

Galley2 promotes the use of application-selected code on the I/O nodes. Sev- 
ered operating systems can download user code into the kernel [14, 26, 1]. Other 
researchers have noted that it is useful to move the function to the data rather 
than to move the data to the function [3, 42, 17]. Some distributed database 
systems execute part of the SQL query in the server rather than the client, to 
reduce client-server traffic [2]. Hatcher and Quinn hint that allowing user code 
to run on nCUBE I/O nodes would be a good idea [18]. 


7 Status 

Galley runs on the IBM SP-2 and on workstation clusters [31], and has so far 
been extremely successful [32]. We have ported several application libraries on 
top of Galley, including a traditional striped-file library, Panda [39, 43], Vesta [6], 
and SOLAR [44]. We are also using Galley to investigate policies for managing 
multi-application workloads. 

We are building a simulator for Galley2, to evaluate some of the key ideas, 
and a full implementation, to experiment with real applications. There is no 
question that it will be a much more flexible system than Galley and its prede- 
cessors. We will declare success if that flexibility provides better performance on a 
wider range of applications. That will occur if the benefits of application-specific 



I/O-node programs outweigh the cost of the extension mechanism (sandboxing, 
context switching, or interpretation). We are optimistic! 

More information about our research can be found at 

http : / /www . cs . dartmouth . edu/research/pario . html 
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